Reinforcement Learning

Key

  • states si
  • actions ai
  • rewards R
  • discount factor γ[0,1]: specifies the relevance in the present of future rewards.
  • return G: cumulative reward
  • policy π: a specification of how the agent acts. π(a|s) gives the probability of taking action a when in state s.

The goal of reinforcement learning is to choose a policy π(s)=a that will tell us which action to take in state s so as to maximize the expected return.

RL Overview

Pasted image 20231011155126.png|400

Model-based vs. Model-free RL

Pasted image 20230425111911.png|400

Value-based vs. Policy-based RL

differs in the types of algorithms used to derive the optimal behavior

Off-policy vs. On-policy methods

differs in whether the agent is using the newest policy that is being updated

Q-Learning

It is a type of reinforcement learning: model-free. value-based, and off-policy

Concepts

How it works

  1. the agent selects the action with the highest Q-value in each state
  2. after each step, the policy is evaluated and the Q-values are updated using the Bellman equation
  3. the process is repeated until the policy converges and selects the same action at a given state

Algorithms

/assets/images/reinforcement-learning-1.png|400

Upper Confidence Bound

UCB is a deterministic algorithm for Reinforcement Learning that focuses on exploration and exploitation based on a confidence boundary that the algorithm assigns to each machine on each round of exploration. These boundary decreases when a machine is used more in comparison to other machines.

example: which Ads customers will click on

How it works

  1. At each round n, we consider two numbers for machine m:
    • N(n) = number of times the machine m was selected up to round n.
    • R(n) = number of rewards of the machine m up to round n.
  2. From these two numbers we have to calculate,
    • The average reward of machine m up to round n, $$r(n) = R(n) / N(n)$$
    • The confidence interval [r(n)Δ(n),r(n)+Δ(n)] at round n with,
Δ(n)=1.5log(n)/N(n)
  1. We select the machine m that has the maximum UCB=r(n)+Δ(n)

Intuitive explanation
https://medium.com/analytics-vidhya/upper-confidence-bound-for-multi-armed-bandits-problem-ea5d7bc840ff

Thompson Sampling (Bayesian Bandits)

How it works

  1. At each round n, we consider two numbers for machine m:
    • Ni1(n) = number of times the ad i got reward 1 up to round n.
    • Ni0(n) = number of times the ad i got reward 0 up to round n.
  2. For each ad i, we take a random draw from the beta distribution below:
3. We select the ad that has the highest $\theta_i (n)$   # Markov decision process The core reinforcement learning approaches define the world as a Markov decision problem, which is built on hidden dynamics and optimal control.  $$G = R_i + \lambda R_{i+1} + \lambda^2 R_{i+2} + \lambda^3 R_{i+3}+ ...

...until termination.